Skip to content

Conversation

@oiffrig
Copy link
Collaborator

@oiffrig oiffrig commented Jun 5, 2025

This PR is meant to be a discussion around supporting generic bootstrapping. Please consider this a proof of concept tp get the discussion going.

Goal

The goal is to add a simple interface for bootstrapping score computations

Suggested implementation

This PR creates a decorator called enable_bootstrap. When used, it turns a function into a Bootstrappable object whose __call__ method wraps the function. It adds a bootstrap method that instead perfoms bootstrapping.

Example

EDIT: replaced bad difference example with mse
EDIT2: added more examples

Decorator

@enable_bootstrap
def mse(x, y, axis=-1):
    return np.mean(np.square(y - x), axis=axis)

x = ...
y = ...
mse(x, y)  # normal call, return MSE
bresult = mse.bootstrap(x, y)  # bootstrapping
bresult.quantiles(0.2)
bresult.sig_mask(0.05, 0.0)

Direct call

import numpy as np
from earthkit.meteo.score import pearson

x = ...
y = ...
bootstrapped = bootstrap(pearson, x, y)
np.quantile(bootstrapped, [0.05, 0.95], axis=0)

Direct call with xarray

import xarray as xr

x = xr.DataArray(..., dims=["number", ...])
bootstrapped = bootstrap((lambda arr: arr.mean("number")), x)
bootstrapped.quantile([0.05, 0.95], "sample")

Questions

  • The interface is so far limited to functions that have two (numpy-like) array inputs and runs bootstrapping on one axis of these arrays. Do we need to extend this?
    • Supporting xarray is probably a good idea, then we can bootstrap on named dimensions
    • DONE 780c0f3
  • The BootstrapResult contains the result for every bootstrapping sample. Should we support returning quantiles from bootstrap by default and only return the full set of samples if explicitly requested?
    • Alternative: just return the samples and let the user do their own statistics
  • Is this efficient enough for common use?
  • Do we need other statistics in the BootstrapResult?
    • Standard deviation?
  • Now that the bootstrap function is available, do we want to keep the decorator?

@codecov-commenter
Copy link

codecov-commenter commented Jun 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.53%. Comparing base (80c220f) to head (780c0f3).
Report is 4 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop      #54   +/-   ##
========================================
  Coverage    99.53%   99.53%           
========================================
  Files           14       14           
  Lines          866      866           
  Branches        15       15           
========================================
  Hits           862      862           
  Misses           2        2           
  Partials         2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martin-janousek-ecmwf
Copy link
Collaborator

Interesting! Would it work for the following use case?
Compute bootstrapped mean of ROC area

  1. Assume an xarray of contingency tables ct["lat","lon","number","date"], where the contingency table elements are stored under the dimension number (for CTs of deterministic forecast the size of this dimension is 4 but for ensemble it's (ens_size+1)*2 (https://sites.ecmwf.int/docs/vtb/annexes.html#annex-2-extended-contingency-table-structure).
  2. Function roc_area(ct,dim="number") returns array roca["lat","lon","date"] (https://github.com/ecmwf/vtb/blob/master/vtb/metricss/xmetrics.py#:~:text=roc_area)
  3. Then to compute a bootstrapped mean of ROC Area along the date dimension one has to
    • create sample of ct bootstrapped along the axis date
    • for each ct sample compute roca
    • compute mean of each roca along date
    • return 5-th and 95-th percentiles of roca means

@Oisin-M
Copy link
Contributor

Oisin-M commented Jun 10, 2025

@oiffrig this looks very cool!

On this question:

The interface is so far limited to functions that have two (numpy-like) array inputs and runs bootstrapping on one axis of these arrays. Do we need to extend this?

I think the answer is yes. We have functions where the bootstrapping axis isn't necessarily the same e.g. score.crps has as input

  • x: array-like (n_ens, n_points)
  • y: array-like (n_points)
    It would be a reasonable use-case to want to bootstrap this over n_points, but these are on different axes (though of course can hack it with axis=-1 for now).

We also have some functions that have more than 2 parameters. I give the example of wind.w_from_omega which has as input

  • omega : array-like
  • t : array-like
  • p : array-like

I also had a couple of thoughts/questions I thought I would throw out:

  • I was actually expecting to return directly the result for every bootstrapped sample, letting the user decide what they want to do after - which could be anything from a .mean(axis=0), quantiles, confidence interval, binning the samples etc. Are we sure that quantiles makes sense as a default?
  • Do we need some sort of nan_policy to decide whether or not we also bootstrap missing values? I think it might be good to handle it also here.
  • I think we should raise a warning if the user tries to set n_samples larger than the input data to sample from. I'm not aware of a case where this makes sense.

@corentincarton
Copy link
Collaborator

corentincarton commented Jun 10, 2025

Thanks @oiffrig, really interesting! I didn't know about bootstrapping so it's always good to learn :)

I'm a bit puzzled about the approach though. From what I see, the bootstrappable function that you use as an example here does not reduce the dataset in the axis direction, i.e.

f(x(n), y(n)) -> z(n)

Couldn't you then bootstrap the result directly instead of bootstrapping the function? That would make the whole process simpler, and would then work for @Oisin-M and @martin-janousek-ecmwf cases.

In order word, I feel the bootstrappable function should be the statistical function that reduces the array (here the array is the output of the metric function) in one or multiple dimensions. But those bootstrappable functions should not be the metrics that compare the experiment with the observation (CRPS, difference, bias, etc.).

I always visualise these evaluation workflows in three steps:

  1. Align experiment with observations (interpolation, slice, etc.), x(lat, lon), y(n) -> x(n), y(n)
  2. Compute metric between x and y, x(n), y(n) -> z(n)
  3. Compute statistics over one or multiple dimensions (here 1) z(n) -> w

In this case, the bootstrapping is happening at the third level, and should be a type of spatial/temporal statistics. If I understand correctly of course. Please correct me if I'm wrong!

@martin-janousek-ecmwf
Copy link
Collaborator

I must admit I was initially confused as my experience with bootstrapping was exactly along the lines Corentin very clearly described. But then I thought, oh, maybe the bootstrapping was meant as something even more general, like the difference() function from the initial comment, where the bootstrapping the differences of two arrays was meant to by done by reshuffling one or both operands and then taking the difference? Which I found interesting, even though I personally was not aware of a use case for it.

@Oisin-M
Copy link
Contributor

Oisin-M commented Jun 10, 2025

Ah yes, I got confused with the first example. I agree actually the difference function should not be bootstrappable, and neither should the score.crps or wind.w_from_omega examples I gave. The functions that should be bootstrappable are quantiles and sig_mask.

The only edge case I'm not sure about is if a function isn't deterministic, but I don't think we have such a case anyway.

@oiffrig
Copy link
Collaborator Author

oiffrig commented Jun 16, 2025

Yes, I tried to come up with a quick example, and got confused. Do disregard the difference example. If you have some examples of bootstrapping I'd be happy to see them, after offline discussion with @corentincarton and @Oisin-M there seems to be quite a range of use cases, and it's definitely worth considering the extent of what we want to support.

@martin-janousek-ecmwf

Interesting! Would it work for the following use case? Compute bootstrapped mean of ROC area

3. Then to compute a bootstrapped mean of ROC Area along the `date` dimension one has to
   
   * create sample of `ct` bootstrapped along the axis `date`
   * for each `ct` sample compute `roca`
   * compute mean of each `roca` along `date`
   * return 5-th and 95-th percentiles of `roca` means

Not as is, but it looks like something we should be able to support. We would need support for xarray (which looks more and more like the reasonable choice to make anyway), and a looser constraint on number on inputs. Then, the updated bootstrapping would be applied on a function that takes ct and computes the mean of roca:

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct, dim=roca_dim).mean(dim=mean_dim)

ct = contingency_table()
bresult = roca_mean.bootstrap(ct)
low, high = bresult.quantiles(0.05)

Does that look reasonable to you?

@Oisin-M
Copy link
Contributor

Oisin-M commented Jun 17, 2025

Correct me if I'm wrong, but I think we don't want to bootstrap the roca_mean function, but rather just the mean function. It's not necessary/desirable to recalculate ROCA for each bootstrapped time sample, since the ROCA doesn't touch the time axis.

From what I understood, the calculation process would be

1. Get CT

ct = read_ct(data)
# dim:size  [lon: Nx, lat: Ny, number: M, date: T]

2. Calculate ROCA

roca = calculate_roca(ct, dim='number')
# dim:size [lat: Nx, lon: Ny, date: T]

Note: This would be bootstrappable along dimension number, but we only want to bootstrap along dimension time. Therefore, we should not be recalculating ROCA for each sampled time - it should be done once.

3. Bootstrap ROCA output

bootstrapped_roca  = bootstrap_samples(roca, dim='date', sample_size=S, n_bootstraps=B)
# dim:size [lat: Nx, lon: Ny, date: S, sample: B]

4. Calculate mean per timestep

sample_means = calculate_mean(bootstrapped_roca, dim='date')
# dim:size [lat: Nx, lon: Ny, sample: B]

5. Calculate 5th and 95 quantiles over the samples

quantiles = calculate_quantiles(sample_means, q=[0.05, 0.95], dim='sample')
# dim:size [lat: Nx, lon: Ny, quantile: 2]

Sidenote: we could also bootstrap this along the dimension sample because it aggregates along that dimension. In practice, this would be pretty weird - we would essentially be trying to estimate the uncertainty in our estimated uncertainty of the mean.


Assuming I haven't misunderstood things, I think in the snippet therefore it should be rather

@enable_bootstrap(dim="date")
def mean(roca_vals, mean_dim="date"):
    return roca_vals.mean(dim=mean_dim)

ct = contingency_table()
roca_vals = roca(ct, dim="number")
bresult = mean.bootstrap(roca_vals)
low, high = bresult.quantiles(0.05)

if we want to stick with returning this BootstrapResult class. Personally, I would be happier getting directly back the xarray, since the user may want to do some odd calculations afterward. There is also already a .quantile method for xr dataarrays.

@martin-janousek-ecmwf
Copy link
Collaborator

But ROCA is computed from a mean (or, equally, a sum) of CTs over time. We want to collect CT at each station over a period of time and only then to compute ROCA (and other CT-based scores like PSS, ETS etc).

@martin-janousek-ecmwf
Copy link
Collaborator

Could it then work like

@enable_bootstrap(dim="date",bdim="sample")
def mean(ct, mean_dim="date"):
    return ct.mean(dim=mean_dim)

ct_vals = contingency_table()
bresult = mean.bootstrap(ct_vals, )
roca_vals = roca(bresult, dim="number")
low, high = roca_vals.quantiles(0.05,dim="sample")

@Oisin-M
Copy link
Contributor

Oisin-M commented Jun 17, 2025

Okay I see, thanks for the clarification @martin-janousek-ecmwf. I think then the snippet @oiffrig sent is the correct approach?

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
return roca(ct, dim=roca_dim).mean(dim=mean_dim)
ct = contingency_table()
bresult = roca_mean.bootstrap(ct)
low, high = bresult.quantiles(0.05)

@martin-janousek-ecmwf
Copy link
Collaborator

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct, dim=roca_dim).mean(dim=mean_dim)

Sorry, but I still can't see this as correct. This is hinting me we first compute ROCA from station's CT and only then the mean. The order is opposite, first we sum up CTs, and only then we apply roca(). Sum and roca are not commutative.

Wouldn't it rather be

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct.mean(dim=mean_dim), dim=roca_dim)

@Oisin-M
Copy link
Contributor

Oisin-M commented Jun 17, 2025

Thanks for the input @martin-janousek-ecmwf. I agree it's not commutative and the calculation order leads to different interpretations, but I do have some confusion around what we're trying to capture.

Happy to defer to your expertise but doesn't averaging before computing ROCA lose temporal resolution? I would have though we would want to do the ROCA per timestep so that we know the full distribution of ROCA values, and then after we find the mean of that distribution. The other way around i.e computing the ROCA on the mean, would result that we only consider the "mean ct" and don't take in other information about the ct distribution, which I guess is why we would want to bootstrap.

But couldn't we just get the distribution of the roca values per timestep directly and find the mean and quantiles of that?

ct = get_ct()
roca_per_timestep = roca(ct) # distribution of roca values

# just directly find metrics about the distribution
mean = mean(roca_per_timestep, dim='time')
low, high = quantiles(roca_per_timestep, dim='time', qs=[0.05, 0.95])

@martin-janousek-ecmwf
Copy link
Collaborator

Well, I suspect my "expertise" can be easily challenged as I am not a statistician and here I am just hiding behind the fact "we have been always doing it this way". And ROCA was probably poorly chosen as an example as both the ROCA of a single step of a single forecast and the monthly mean ROCA of a single step make sense. A different score, like the ETS computed from a "deterministic" 2x2 contingency table, would have been a better choice to make my point, as ETS is the case when one has first to accumulate CTs over some time period and only then can compute the derived statistics (ETS).

That nonetheless does not change the fact more often than not we do not use an arithmetic average of "daily" scores to get their average value over a period of time (for the same lead time). We use arithmetic mean for bias, but monthly RMSE is not an average of daily RMSEs but sqrt(mean(RMSE**2)), for ACC we don't average ACCs but their Fisher-transformed values. And monthly means of ROCA, ETS, PSS etc are computed from month-accumulated CTs. And the biggest pain, the skill scores.

Honestly, these are kind of a corner cases, always implemented only later on, but it may be useful to be aware of them as they can break the system.

@oiffrig
Copy link
Collaborator Author

oiffrig commented Jun 18, 2025

Thanks both for your input. I'll try to refine the implementation based on this feedback.

Regarding skill scores, I would be keen on having a look at these use cases anyway, just to see how different they are. It would be great if we could support them out of the box!

@oiffrig
Copy link
Collaborator Author

oiffrig commented Jun 24, 2025

Dear all,

I have now added support for xarray. I have also exposed a few functions, notably resample which returns input arrays for bootstrapping, and bootstrap which does the bootstrapping. I think this should cover a good range of use cases, but please do point me to other examples if you have specific use cases in mind.

Now that we have the bootstrap function, this puts into question the usefulness of the enable_bootstrap decorator. Is there any point in keeping it around? I feel like it would be better to just make the call to bootstrap explicit.

Also, at the moment bootstrap does not return a BootstrapResult. I feel like this is now quite superfluous, especially with xarray supporting a range of statistics out of the box anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants